PSYC 3032 M
Module 2’s topics relate to modelling (linear) relationships between variables to help address interesting questions like…
How does education affect earnings?
To what extent listening to ska music relates to dressing in black and white checkered clothing articles?
How strong is the relationship between taking PSYC3032 and being a billionaire?
What is the association between OCD and depression?
Does exercise influence psychological brain states, such as depression or anxiety?
Correlation and Simple Linear Regression
Before we dive into each topic separately, it’s useful to put both in the right context.
Correlation and simple regression are often used interchangeably, but there are key conceptual differences (but not mathematical) between them.
Correlation describes the strength of the (primarily linear) relationship or association between two variables
Correlation is used mainly as a descriptive statistic, to quantify an association, but NOT saying anything about causation
Though, as it turns out, the math required to obtain correlation estimates requires the same information used in…..you guessed it, simple linear regression!
Note that correlation is cause-blind (association\(\neq\)causation), we often graph the relationship with double-sided arrows (i.e., we don’t know/care about why they relate, we just know they vary together)
Regression models, on the other hand, are necessarily directional (one-sided arrow), meaning, we make a statement/assumption about what causes/affects what (e.g., X2 leads to Y2)
Covariance and Correlation
Before we define correlation, which is, in fact, a standardized effect size measure, we should first talk about covariance, the unstandardized sibling of correlation
By now you should know that each variable has its own variance—which describes the spread of the individual observations on that particular variable
\[VAR(X)=\frac{\sum (x_i-\bar{x})^2}{N-1}= \frac{\sum (x_i-\bar{x})(x_i-\bar{x})}{N-1} \]
where \(x_i\) is a particular observation’s score on X, \(\bar{x}\) is the mean of X, and \(N\) is the sample size.
The numerator represents the sum of squares (i.e., the sum of squared deviation scores from the mean)
\[COV(X, \ Y)= \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{N-1}\]
This measure describes how much the variables co-vary together; the covariance gives us a measure of how these two variables \(X\) and \(Y\) are associated
If Y tends (i.e., on average) to be above its mean when X is above its respective mean then \(COV(X, Y) = +\); if Y tends to be above its mean when X is below its respective mean then \(COV(X, Y) = -\); when \(COV(X, Y ) = 0\), we say that X and Y are independent or orthogonal of one another
Covariance is an important statistic, but because it’s in some hybrid metric (the product of the units of X with the unit of Y), it’s hard to to gauge its magnitude
With this definition of covariance we can now define Pearson’s correlation parameter
\[\rho = \frac{COV(X,Y)}{SD(X) \cdot SD(Y)}\]
where \(SD(X)\) and \(SD(Y)\) are the standard deviation of X and Y, respectively.
Dividing by the standard deviations of X and Y removes both the metrics, thereby standardizing the covariance and making it in a comprehensible metric
Correlation is, thus, the standardized version of covariance
A correlation coefficient is a single numeric value representing the degree to which two variables are associated with one another
Because correlation is a standardized effect size measure, correlation coefficients are bounded by –1 and +1
The sign indicates the direction of the association, while the magnitude of the measure indicates the strength of the association
\(|1|\) = perfect relationship; 0 = no relationship
The correlation formula above is for the Pearson Product-Moment Correlation Coefficient between two continuous variables, but there are others (which we don’t discuss)
\(\rho\) or its estimate \(r\) do not provide a complete description of the two variables; you should always provide means and standard deviations.
Correlation measures the strength of the linear relationship between x and y only; it’s inappropriate to use a correlation to describe nonlinear relationships
Pearson’s correlation assumes that both the variables are normally distributed and that the spread of the scores on one variable is constant across the other
We can check all of that, and it’s ALWAYS a good idea to visualize the association (recall the first step in Modeling Steps 👣?)
How would you describe the relationship between age and height? Can you guess the correlation coef?
We could do this ourselves (first block), or let R do this for us!
[1] 0.7824843
which is the same as…
A correlation coefficient is an effect size
It describes the magnitude and direction of the effect (association between two variables)
According to conventional benchmarks (based on Cohen’s rules of thumb):
Another way to express the magnitude of the effect is to square correlation to get the coefficient of determination, \(r^2\)
The coefficient of determination provides the proportion of variance in one variable that is shared or accounted for by the other
Used when interested in the relationship between two variables, controlling for the effects of a third variable
Partial-\(r^2\) capture the relationship between X and Y, controlling for the effect that a third variable (Z) has on both X and Y.
Semi-partial-\(r^2\) capture the relationship between X and Y controlling for the effect that Z has only on Y (remember for multiple regression!)
Returning to our previous example of age and height
When adding a third variable, weight:
height age weight
height 1.0000000 0.4201188 0.1302424
age 0.3757845 1.0000000 0.3018020
weight 0.1384063 0.3585575 1.0000000
Semi-partial-\(r^2=0.38\) means that when “controlling” for weight (or holding weight constant), age and height share about 38% of their variability
Or, you can say that about 38% of the total variability in height is shared uniquely with age
Say we’re interested in testing whether there is no correlation between two random variables X and Y, we test the null
\[H_0: \rho = 0\]
using the t ratio:
\[t(df)=\frac{r-\rho_0}{SE_r}=\frac{r-0}{\sqrt{\frac{1-r^2}{N-2}}}=\frac{r\sqrt{N-2}}{\sqrt{1-r^2}}\]
with \(df = N − 2\) degrees of freedom, where r is the observed correlation, \(\rho_0\) is the specified correlation value under the null (i.e., 0), and N is the sample size
Rejecting this null would indicate that the r you observed was “surprising” given the position of ignorance, \(\rho = 0\)
Might not be very impressive (e.g., large sample size), so probably want to think in terms of CIs (what is \(\rho\) likely to be?)
A researcher collected data from 275 undergraduate participants and hypothesized a relationship between aggression and impulsivity. She measured aggression using the Buss Perry Aggression Questionnaire (BPAQ) and impulsivity using the Barratt Impulsivity Scale (BIS)
Load the data in R
Examine descriptive information about your variables, e.g., means, SDs, histograms/boxplots and scatterplots
Pearson's product-moment correlation
data: agrsn$BPAQ and agrsn$BIS
t = 5.5939, df = 273, p-value = 5.391e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2103747 0.4229210
sample estimates:
cor
0.3206789
[1] 0.102835
“In our sample of \(N = 275\) undergraduate students, there was a statistically significant relationship between scores on the BPAQ and BIS (\(r = 0.32\), 95% CI \([0.21, 0.42]\), \(t(273) = 5.59\), \(p < 0.001\)). These results suggest a moderate association between aggression and impulsivity. The narrow confidence interval (\([0.21, 0.42]\)) indicates a relatively precise estimate. Furthermore, aggression and impulsivity share about 10% of their variance (\(r^2 = 0.10\)), which is a modest yet meaningful amount.”
Simple Linear Regression
Mathematically, the GLM can be expressed like this:
\[y_i = \beta_0 + \beta_1 x1_{i} + \beta_2 x2_{i} + \dots + \beta_p xp_{i} + \epsilon_i\]
where
To start, we’re going to focus on simple linear regression, meaning that we have one IV (predictor) and one DV (outcome)
Or, less angrily, the SLR looks like this:
\[y_i = \beta_0 + \beta_1 x1_{i} + \epsilon_i\]
A regression model is a formal model for expressing the tendency of the outcome variable, Y, to vary conditionally on the predictor variable, X.
This SLR model has 3 parameters and 3 variables, can you identify two of each?
Population:
\[y_i = \beta_0 + \beta_1 x1_{i} + \epsilon_i\]
Sample: \[y_i = \hat{\beta}_0 + \hat{\beta}_1 x1_{i} + e_i\]
\[y_i = \hat{\beta}_0 + \hat{\beta}_1 x1_{i} + e_i\]
The linear regression model is designed to work with a continuous outcome variable
The residuals, \(e\), represent the inaccuracy of the model’s ability to reproduce (i.e., predict/explain) the value of \(y_i\) for a given person
\(\epsilon\) (and thus \(e\)) is assumed (for proper SE estimates, CIs, p values, etc.) to be normally distributed (across all levels of X) with a mean of 0 (the SD/variance is estimated)
\(Y\) and \(\epsilon\) (and thus \(e\)) are assumed to be uncorrelated random variables, and X is assumed to be an error free (yeah right…) component that we are using to predict values in Y
Parameters without the i subscript are constants—they do not vary across observations—\(\beta_0\) and \(\beta_1\) have one value for all observations/individuals
Regression models are all about conditional expectations (i.e., conditional means):
\[E(y_i|x1_i) = \beta_0 + \beta_1 x1_i\]
The intercept, \(\beta_0\), is the expected value on Y for a hypothetical observation with \(x1_i=0\)
The slope measures the strength of the linear relationship between X and Y and indicates that: a one unit increase on X results in an expected/predicted change of \(\beta_1\) in Y
Example: say that you were studying the relationship between a final grade in PSYC 3032 (in %) and the combined score on the assignments (out of a total of 50):
\[E( final| assignments) = \beta_0 + \beta_1 \times assignments\] \[E( final| assignments) = 42 + 1.16\times assignments\]
What this means: \(\beta_0 = 42\) indicates that a person who scored 0 on the assignments is expected/predicted to finish the course with 42%, while \(\beta_1 = 1.16\) means that every 1-point increase on the assignments is expected to increase the final grade by 1.16%
What about for specific individual; for example, someone who obtained 40/50? Notice below that the value of 40 is put into the equation, because the \(\beta_1\) is with respect to the raw assignments score.
\[E( final| assignments) = \hat{y}_i= 42 + 1.16\times 40 = 88.4\]
In other words, that individual is expected to obtain 88.4% as their final grade, given their assignment scores (how close this average guess is depends on how good the model is at prediction!).
\(\hat{y_i}\) is called the predicted value for the ith observation; so, for that individual above the predicted/expected value, given their score on the assignment is, \(\hat{y_i}=88.4\).
…Let’s look again at the SLR plot
The blue line, which is specified by our best \(\beta_0\) and \(\beta_1\) estimates, is called the “Line of Best Fit,” and it “cuts” right through all the observation. But how can find what it is!?
OLS is a mathematical solution for finding out the “best” parameter estimates of a linear regression model.
But, what does “best” even mean?
“Best” parameter estimates means that the model (i.e., the regression line) gives us the most accurate prediction/explanation power! (in our sample)
As its name suggests, OLS gives us the values for the intercept and slope(s) that yield the minimum (least) squares (squared residuals) (i.e., “best!”).
We can square the residuals and find their sum. But, wait! We would need to know the estimate for \(\beta_0\) and \(\beta_1\) to get the residuals, right? Recall, \[e_i = y_i-\hat{y}_i=y_i-(\beta_0+\beta_1 x1_i),\]
So what do we do!?
We can “build” a function from the calculation of all \(e\)s in the sample, but leave two unknowns (intercept and slope):
\[\sum_{i=1}^N{e^2}=\sum_{i=1}^N{(y_1-\hat{y}_i)^2}=\sum_{i=1}^N{(y_i-[\beta_0+\beta_1 x1_i])^2}\]
Now, with a little bit of calculus and linear algebra, we can find the solution (the minimum of the function):
Check in question
Why do we square the residuals?
Don’t worry too much about the math, all you need to know is that there is an “easy” solution:
\[\hat{\beta}_1=\frac{COV(X,Y)}{VAR(X)} =\frac{\sum_{i=1}^N (x1_i - \bar{x})(y_i-\bar{y})}{\sum_{i=1}^N (x1_i - \bar{x})^2}=r \times \frac{SD(Y)}{SD(X)}\]
With \(\hat{\beta}_1\), we can solve for the intercept:
\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1 \bar{x}\]
And, \(\sigma_e\), can be calculated as:
\[\hat{\sigma}_e = \sqrt{\frac{\sum (y_i-\hat{y}_i)^2}{N-2}}= \sqrt{\frac{\sum e^2}{df}}\]
Check in Question
If both X and Y are standardized, what does \(\hat{\beta}_1\) equal?
Often researchers seek to test the statistical significance of the slope parameter and of the proportion of variability in the outcome it shares/explains.
Specifically, we test the following null and alternative hypotheses:
\[H_0:\beta_1=0\] \[H_1:\beta_1\neq0\]
Rejecting this null, \(H_0:\beta_1=0\), would indicate that the \(\hat{\beta}_1\) you found was “surprising” given the position of ignorance, \(\hat{\beta}_1 = 0\); that the population relationship between X and Y is unlikely to be zero…
…These hypotheses are tested using a type of t test. Specifically, each parameter has its own standard error such that we can calculate a ratio of the effect (estimated parameter) over noise (the standard error) and get a p value associated with the t statistic.
\[t(df)=\frac{\beta_1}{SE_{\beta_1}}\]
with \(df = N − 2\) degrees of freedom, where r is the observed correlation, \(\rho_0\) is the specified correlation value under the null (i.e., 0), and N is the sample size
It is also easy to obtain CIs for the individual parameters using the following equation:
\[\hat{\beta}_1 \pm t_{cv}\times SE_{\beta_1}\]
where \(t_{cv}\) is the critical value for the t dist. with \(df\) degrees of freedom
The effect sizes of interest in SLR are the regression analogues of the effect sizes of interest in correlation.
We will revisit the research example from last week examining the relationship between impulsivity and aggression: 275 undergraduates completed a questionnaire assessing scores on the BPAQ and BIS scales among others.
Ultimately, the researcher is interested in predicting aggression from impulsivity
At a very simple level, the researcher wants to devise a model for aggression (operationalized with BPAQ scores) to explain or predict how and why people vary on this variable, given their BIS score.
Let’s revisit the example from last week to highlight more of a regression lens as opposed to correlation.
Load the data in R
Examine descriptive information about your variables, e.g., means, SDs, histograms/boxplots and scatterplots
Call:
lm(formula = BPAQ ~ BIS, data = agrsn)
Residuals:
Min 1Q Median 3Q Max
-1.14134 -0.30470 0.00845 0.35500 1.35527
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.5217 0.1973 7.713 2.31e-13 ***
BIS 0.4777 0.0854 5.594 5.39e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4972 on 273 degrees of freedom
Multiple R-squared: 0.1028, Adjusted R-squared: 0.09955
F-statistic: 31.29 on 1 and 273 DF, p-value: 5.391e-08
2.5 % 97.5 %
(Intercept) 1.1333249 1.9101360
BIS 0.3095758 0.6458088
“In our sample of \(N = 275\) undergraduate students, impulsivity (BIS) was found to predict aggression (BPAQ). For every 1-point increase on BIS, aggression (BPAQ) was predicted to increase by approximately 0.48 points (\(\hat{\beta}_1 = 0.48\), 95% CI \([0.31, 0.65]\)). This association was statistically significant (\(t(273) = 5.59\), \(p < 0.001\)). The narrow confidence interval suggests a rather precise estimate of the effect size. Furthermore, impulsivity explained about 10% of the variability in aggression (\(R^2 = 0.10\)), indicating a modest yet meaningful proportion.”
Module 2